Goto

Collaborating Authors

 sentence 1


Annotating Training Data for Conditional Semantic Textual Similarity Measurement using Large Language Models

Zhang, Gaifan, Zhou, Yi, Bollegala, Danushka

arXiv.org Artificial Intelligence

Semantic similarity between two sentences depends on the aspects considered between those sentences. To study this phenomenon, Deshpande et al. (2023) proposed the Conditional Semantic Textual Similarity (C-STS) task and annotated a human-rated similarity dataset containing pairs of sentences compared under two different conditions. However, Tu et al. (2024) found various annotation issues in this dataset and showed that manually re-annotating a small portion of it leads to more accurate C-STS models. Despite these pioneering efforts, the lack of large and accurately annotated C-STS datasets remains a blocker for making progress on this task as evidenced by the subpar performance of the C-STS models. To address this training data need, we resort to Large Language Models (LLMs) to correct the condition statements and similarity ratings in the original dataset proposed by Deshpande et al. (2023). Our proposed method is able to re-annotate a large training dataset for the C-STS task with minimal manual effort. Importantly, by training a supervised C-STS model on our cleaned and re-annotated dataset, we achieve a 5.4% statistically significant improvement in Spearman correlation. The re-annotated dataset is available at https://LivNLP.github.io/CSTS-reannotation.


Enhancing Negation Awareness in Universal Text Embeddings: A Data-efficient and Computational-efficient Approach

Cao, Hongliu

arXiv.org Artificial Intelligence

--Negation plays an important role in various natural language processing tasks such as Natural Language Inference and Sentiment Analysis tasks. Numerous prior studies have found that contextual text embedding models such as BERT, ELMO, RoBERT a or XLNet face challenges in accurately understanding negation. Recent advancements in universal text embeddings have demonstrated superior performance over contextual text embeddings in various tasks. However, due to the bias in popular evaluation benchmarks, the negation awareness capacity of these models remains unclear . T o bridge the gap in existing literature, an in-depth analysis is initiated in this work to study the negation awareness of cutting-edge universal text embedding models. Our findings reveal a significant lack of negation awareness in these models, often interpreting negated text pairs as semantically similar . T o efficiently deal with the conflict that different tasks need different trade-offs between topic and negation information among other semantic information, a data-efficient and computational-efficient embedding re-weighting method is proposed without modifying the parameters of text embedding models. The proposed solution is able to improve text embedding models' negation awareness significantly on both simple negation understanding task and complex negation understanding task. Furthermore, the proposed solution can also significantly improve the negation awareness of Large Language Model based task-specific high dimensional universal text embeddings.


MCiteBench: A Benchmark for Multimodal Citation Text Generation in MLLMs

Hu, Caiyu, Zhang, Yikai, Zhu, Tinghui, Ye, Yiwei, Xiao, Yanghua

arXiv.org Artificial Intelligence

Multimodal Large Language Models (MLLMs) have advanced in integrating diverse modalities but frequently suffer from hallucination. A promising solution to mitigate this issue is to generate text with citations, providing a transparent chain for verification. However, existing work primarily focuses on generating citations for text-only content, overlooking the challenges and opportunities of multimodal contexts. To address this gap, we introduce MCiteBench, the first benchmark designed to evaluate and analyze the multimodal citation text generation ability of MLLMs. Our benchmark comprises data derived from academic papers and review-rebuttal interactions, featuring diverse information sources and multimodal content. We comprehensively evaluate models from multiple dimensions, including citation quality, source reliability, and answer accuracy. Through extensive experiments, we observe that MLLMs struggle with multimodal citation text generation. We also conduct deep analyses of models' performance, revealing that the bottleneck lies in attributing the correct sources rather than understanding the multimodal content.


Fundamental Principles of Linguistic Structure are Not Represented by o3

Murphy, Elliot, Leivada, Evelina, Dentella, Vittoria, Gunther, Fritz, Marcus, Gary

arXiv.org Artificial Intelligence

Instead of scaling to unprecendented levels of compute via architectures that are fundamentally grounded in token prediction, a return to more traditional design features of the human mind (predicate-argument structure, variable binding, constituent structure, minimal compositional binding; Donatelli & Koller 2023) may be needed to orchestrate a more reliable expertise in human language (Ramchand 2024). This could be implemented by forms of neuro-symbolic approaches. Still, it is also certainly true that mainstream theoretical linguistics (e.g., the minimalist enterprise) was in some ways ill-equipped to successfully predict which patterns of linguistic activity might be (un)approachable by LLMs. To illustrate, a potential weakness in this direction with respect to recent generative grammar theorizing has been the underestimation of the extent to which lexical information drives composition. This type of information may permit LLMs to abductively infer certain elements of grammatical rules, in whatever format this ultimately takes (Ramchand 2024). Future research should more carefully apply the tools of linguistics to isolate specific sub-components of syntax that might be in principle achievable by language models, given specific design features. For instance, with LLMs "complete recovery of syntax might be very di`icult computationally" (Marcolli et al. 2025: 13), even if we assume that attention modules can in principle "satisfy the same algebraic structure" as what Marcolli et al. postulate as being necessary for syntaxsemantics interface mappings.


Assessing Open-Source Large Language Models on Argumentation Mining Subtasks

Abkenar, Mohammad Yeghaneh, Wang, Weixing, Graupner, Hendrik, Stede, Manfred

arXiv.org Artificial Intelligence

We explore the capability of four open-sourcelarge language models (LLMs) in argumentation mining (AM). We conduct experiments on three different corpora; persuasive essays(PE), argumentative microtexts (AMT) Part 1 and Part 2, based on two argumentation mining sub-tasks: (i) argumentative discourse units classifications (ADUC), and (ii) argumentative relation classification (ARC). This work aims to assess the argumentation capability of open-source LLMs, including Mistral 7B, Mixtral8x7B, LlamA2 7B and LlamA3 8B in both, zero-shot and few-shot scenarios. Our analysis contributes to further assessing computational argumentation with open-source LLMs in future research efforts.


GrammaMT: Improving Machine Translation with Grammar-Informed In-Context Learning

Ramos, Rita, Chimoto, Everlyn Asiko, ter Hoeve, Maartje, Schluter, Natalie

arXiv.org Artificial Intelligence

We introduce GrammaMT, a grammatically-aware prompting approach for machine translation that uses Interlinear Glossed Text (IGT), a common form of linguistic description providing morphological and lexical annotations for source sentences. GrammaMT proposes three prompting strategies: gloss-shot, chain-gloss and model-gloss. All are training-free, requiring only a few examples that involve minimal effort to collect, and making them well-suited for low-resource setups. Experiments show that GrammaMT enhances translation performance on open-source instruction-tuned LLMs for various low- to high-resource languages across three benchmarks: (1) the largest IGT corpus, (2) the challenging 2023 SIGMORPHON Shared Task data over endangered languages, and (3) even in an out-of-domain setting with FLORES. Moreover, ablation studies reveal that leveraging gloss resources could substantially boost MT performance (by over 17 BLEU points) if LLMs accurately generate or access input sentence glosses.


Metacognitive Monitoring: A Human Ability Beyond Generative Artificial Intelligence

Huff, Markus, Ulakçı, Elanur

arXiv.org Artificial Intelligence

Large language models (LLMs) have shown impressive alignment with human cognitive processes, raising questions about the extent of their similarity to human cognition. This study investigates whether LLMs, specifically ChatGPT, possess metacognitive monitoring abilities akin to humans-particularly in predicting memory performance on an item-by-item basis. We employed a cross-agent prediction model to compare the metacognitive performance of humans and ChatGPT in a language-based memory task involving garden-path sentences preceded by either fitting or unfitting context sentences. Both humans and ChatGPT rated the memorability of these sentences; humans then completed a surprise recognition memory test. Our findings reveal a significant positive relationship between humans' memorability ratings and their actual recognition performance, indicating reliable metacognitive monitoring. In contrast, ChatGPT did not exhibit a similar predictive capability. Bootstrapping analyses demonstrated that none of the GPT models tested (GPT-3.5-turbo, GPT-4-turbo, GPT-4o) could accurately predict human memory performance on a per-item basis. This suggests that, despite their advanced language processing abilities and alignment with human cognition at the object level, current LLMs lack the metacognitive mechanisms that enable humans to anticipate their memory performance. These results highlight a fundamental difference between human and AI cognition at the metacognitive level. Addressing this gap is crucial for developing AI systems capable of effective self-monitoring and adaptation to human needs, thereby enhancing human-AI interactions across domains such as education and personalized learning.


Generating Synthetic Datasets for Few-shot Prompt Tuning

Guo, Xu, Du, Zilin, Li, Boyang, Miao, Chunyan

arXiv.org Artificial Intelligence

A major limitation of prompt tuning is its dependence on large labeled training datasets. Under few-shot learning settings, prompt tuning lags far behind full-model fine-tuning, limiting its scope of application. In this paper, we leverage the powerful LLMs to synthesize task-specific labeled data for training the soft prompts. We first introduce a distribution-aligned weighted generator tuning (DawGen) method to encourage generating in-distribution data that aligns with the few-shot real data. Then, we train soft prompts on both synthetic and real datasets using a gradient surgery approach, which eliminates the conflicting gradients from different data sources. Experiments on seven sentence-pair classification datasets demonstrate the effectiveness of our proposed method for boosting prompt tuning in few-shot learning settings. Results on QQP, MRPC, and SICK datasets are even comparable to the performance of transfer learning from large real-world datasets, showing the promise of synthetic data as an alternative for enhancing soft prompt tuning.


Towards Automating Text Annotation: A Case Study on Semantic Proximity Annotation using GPT-4

Yadav, Sachin, Choppa, Tejaswi, Schlechtweg, Dominik

arXiv.org Artificial Intelligence

This paper explores using GPT-3.5 and GPT-4 to automate the data annotation process with automatic prompting techniques. The main aim of this paper is to reuse human annotation guidelines along with some annotated data to design automatic prompts for LLMs, focusing on the semantic proximity annotation task. Automatic prompts are compared to customized prompts. We further implement the prompting strategies into an open-source text annotation tool, enabling easy online use via the OpenAI API. Our study reveals the crucial role of accurate prompt design and suggests that prompting GPT-4 with human-like instructions is not straightforwardly possible for the semantic proximity task. We show that small modifications to the human guidelines already improve the performance, suggesting possible ways for future research.


MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization

Chen, Yuyan, Wen, Zhihao, Fan, Ge, Chen, Zhengyu, Wu, Wei, Liu, Dayiheng, Li, Zhixu, Liu, Bang, Xiao, Yanghua

arXiv.org Artificial Intelligence

Prompt engineering, as an efficient and effective way to leverage Large Language Models (LLM), has drawn a lot of attention from the research community. The existing research primarily emphasizes the importance of adapting prompts to specific tasks, rather than specific LLMs. However, a good prompt is not solely defined by its wording, but also binds to the nature of the LLM in question. In this work, we first quantitatively demonstrate that different prompts should be adapted to different LLMs to enhance their capabilities across various downstream tasks in NLP. Then we novelly propose a model-adaptive prompt optimizer (MAPO) method that optimizes the original prompts for each specific LLM in downstream tasks. Extensive experiments indicate that the proposed method can effectively refine prompts for an LLM, leading to significant improvements over various downstream tasks.